General Introduction

Hello and welcome to this short documentation of my forecasting project, which is done within the context of a forecasting theory seminar.

This forecasting project is about predicting the overall taxi rides in New York, Queens (probably over a time span of a month, but on a daily basis).

Here is a short map of Queens, and its districts:

From: NYC.gov

One could decompose this forecasting/ prediction project within each district and sum those up at the end. Decomposing/ segmentation is one of the forecasting principles (Green, Graefe, Armstrong. 2001), so I’ll try to dissect the problem into several smaller one. As I’ve looked through the raw data itself, the JFK Airport (132) and the LaGuardia Airport (138) have the most taxi rides, whereas the other regions are almost negligible at best. But if we sum all of the other regions, we may get a 3rd “region”. So in this way, I will forecast the total amount of taxi rides per day by summing up the predicted taxi rides for the 3 different region(s).

But lets import all the data first.

Getting the data

Notes for the dataset

The following .rds files are the results of several cleaning and importing steps. Because the dataset itself was patched together with over 133 PARQUET files (just a different format to store the data in) which also takes time, I decided to store them in .rds files so it is easier to access without having to run the same code over and over again for each session.

The data was collected through several sources. Yellow taxi related data comes from NYC.gov, whereas weather related data comes from the National Weather Service. As a short limitation from the start, the weather data was gathered from the La Guardia Airport (so one of the subsets/location site we will be forecasting on). Whether this data holds for the other parts is not known, but I would argue that it is less reliable (and therefore valid) for the other parts.

As the amount of taxi rides was influences by the COVID-19 pandemic as well, I’ve searched through the internet and have found a timeline of government responses (thus also their lock down dates and other restrictions). As it is not in a beautiful data format, the only thing I can do is hardcode it in R via a vector and append it on the dataframe. The variable will either be:

  • 5 (starting from 7th march 2020, when the state of emergency was declared)
  • 4 (after 8th of June 2020, when Phase 1 of a reopening plan began)
  • 3 (after 22th of June, when Phase 2 began)
  • 2 (after 6th of July, when Phase 3 began)
  • 1 (after 20th of July, when Phase 4 began)
  • 0 (after 21th of May 2021, when the rules for wearing face masks and social distancing were at a all time low).

This will be a limitation as well, of course.

Importing and Cleaning the data

#read in data
taxi_rides <- readRDS("cleaned_datasets\\taxi_rides.rds")
taxi_zone <- readRDS("cleaned_datasets\\taxi_zone.rds")
max_rain <- readRDS("cleaned_datasets\\max_rain_LG_Airport.rds")
max_temp <- readRDS("cleaned_datasets\\max_temp_LG_Airport.rds")

#join temp and rain dataset, and clean them accordingly
weather_data <- max_rain %>% 
   full_join(max_temp, by = c("Jahr", "Monat", "Tag")) %>% 
   mutate(Monat = as.numeric(Monat)) %>% 
   select(-Max_rain_inches, -Max_Temp)

# we do not have many datapoints for February 2024, so I'll remove everything up until 31.01.2024
taxi_rides %>% 
   group_by(Year_PU, Month_PU) %>% 
   summarize(count_Days = n_distinct(Day_PU)) %>% 
   paged_table(options = list(rows.print = 12))
#remove days where year is 2024, but month is 2 (not enough data points to work with, I'll forecast for the January instead)
remove_rows <- which(taxi_rides$Year_PU == 2024 & taxi_rides$Month_PU == 2)
taxi_rides <- taxi_rides[-remove_rows, ]

remove_rows <- which(taxi_zone$Year_PU == 2024 & taxi_zone$Month_PU == 2)
taxi_zone <- taxi_zone[-remove_rows, ]

full_ride <- taxi_rides %>% 
   left_join(weather_data, by = c("Year_PU" = "Jahr",
                                  "Month_PU" = "Monat",
                                  "Day_PU" = "Tag"))

full_zone <- taxi_zone %>% 
   left_join(weather_data, by = c("Year_PU" = "Jahr",
                                  "Month_PU" = "Monat",
                                  "Day_PU" = "Tag"))


#look whether there are NA's created
sum(is.na(full_ride)); sum(is.na(full_zone))

full_zone2 <- full_zone %>% group_by(PULocation, Year_PU, Month_PU, Day_PU, Max_Temp_C, Max_rain_mm) %>% 
   summarise(rides_day = sum(rides)) %>% 
   ungroup() %>% 
   mutate(date = as.Date(paste(Year_PU, Month_PU, Day_PU, sep = "-")))%>% 
   select(-c(Year_PU, Month_PU, Day_PU))

full_ride2 <- full_ride %>% group_by(Year_PU, Month_PU, Day_PU, Max_Temp_C, Max_rain_mm)  %>% 
   summarise(rides_day = sum(rides)) %>% 
   ungroup() %>% 
   mutate(date = as.Date(paste(Year_PU, Month_PU, Day_PU, sep = "-"))) %>% 
   select(-c(Year_PU, Month_PU, Day_PU))

# create vector where we have the "lockdown" data as specified in the previous section
lockdown_dat <- data.frame(date = full_ride2$date) %>% 
   mutate(lock_down = case_when(
    between(date, as.Date("2020-03-07"), as.Date("2020-06-07")) ~ 5,
    between(date, as.Date("2020-06-08"), as.Date("2020-06-22")) ~ 4,
    between(date, as.Date("2020-06-23"), as.Date("2020-07-06")) ~ 3,
    between(date, as.Date("2020-07-07"), as.Date("2020-07-20")) ~ 2,
    between(date, as.Date("2020-07-21"), as.Date("2021-05-21")) ~ 1,
    TRUE ~ 0
  ))

full_ride2 <- full_ride2 %>% 
   full_join(lockdown_dat, by = "date")

full_zone2 <- full_zone2 %>% 
   full_join(lockdown_dat, by = "date")

Forecasting Project

Let’s start by turning the dataframe into a tsibbles (time series tibbles):

zone <- full_zone2 %>% as_tsibble(key = PULocation, index = date)

rm(list = setdiff(ls(),c("zone", "full_ride2")))

zone
## # A tsibble: 12,144 x 6 [1D]
## # Key:       PULocation [3]
##    PULocation  Max_Temp_C Max_rain_mm rides_day date       lock_down
##    <chr>            <dbl>       <dbl>     <int> <date>         <dbl>
##  1 JFK Airport       5              0      7898 2013-01-01         0
##  2 JFK Airport       1.11           0      9898 2013-01-02         0
##  3 JFK Airport       1.11           0      8773 2013-01-03         0
##  4 JFK Airport       3.89           0      7565 2013-01-04         0
##  5 JFK Airport       6.11           0      7797 2013-01-05         0
##  6 JFK Airport       8.33           0      8305 2013-01-06         0
##  7 JFK Airport       7.22           0      9019 2013-01-07         0
##  8 JFK Airport       9.44           0      7128 2013-01-08         0
##  9 JFK Airport      10.6            0      6846 2013-01-09         0
## 10 JFK Airport       8.89           0      7183 2013-01-10         0
## # ℹ 12,134 more rows

We have data ranging from 2013-01-01 to 2024-01-31, so 4048 days in total.

Descriptive Plots

In this section, we will just plot our metric of interest, that is taxi rides on a given day, over time.
Two options are displayed: amount of taxi rides divided by the 3 zones, or the total taxi rides (which is just the summed up version of the 3 zones).
In both cases, we can see that the amount of taxi rides dipped in the beginning of the year 2020, when the pandemic hit USA. But overall, we could also see a slight trend downwards from 2013 to the beginning of 2020, and a slight increase after 2022.

Taxi Rides Per Zone

Here is a plot where we can see the total amount of rides per zone:

zone %>% autoplot(rides_day, alpha = .5)+
   labs(y = "Taxi Rides")

Total Taxi Rides

Here is a plot which gives us a feel for the total number of taxi rides over time (this is what we want to forecast):

total <- zone %>% index_by(date) %>% 
   summarize(Total= sum(rides_day))

total %>% autoplot(Total)+
   labs(y = "Total Taxi Rides")+
   theme_minimal()

Box-Cox transformed total rides

If we want to have similar variances in each time period (and thus, having similar seasonal patterns over time), we can use transformations to achieve said goal. Box-cox transformation mixes logarithms and power transformations and depend on \(\lambda\). We can use the features function to search for the optimal \(\lambda\)-value for our case.

lambda <- total %>% 
   features(Total, features = guerrero) %>% 
   pull(lambda_guerrero)

total <- total %>% 
   mutate(ride_tr = box_cox(Total, lambda))

total %>% 
   autoplot(ride_tr)+
   labs(y = "Transformed Total Taxi Rides")+
   theme_minimal()

As this plot seems to have better properties, we will use these transformed data from here on out, but will transform back so the results are interpretable if needed.
The only concern I have going forward is the data still contains huge “outliers” sort of, where the amount of taxi rides takes a dip, and goes back to normal… Maybe it was due to a malfunction during data collection for most taxis?

Descriptive Features

Instead of only plotting what we have, one can also describe the data with some descriptive statistics. Here is a table consisting of the 3 different zones and simple statistics:

zone %>% 
   features(rides_day, list(mean = mean, sd = sd, median = median, min = min, max = max))
## # A tibble: 3 × 6
##   PULocation          mean    sd median   min   max
##   <chr>              <dbl> <dbl>  <dbl> <int> <int>
## 1 JFK Airport        6293. 2520.  6917     16 12140
## 2 La Guardia Airport 6307. 3698.  6152.     7 17407
## 3 Other              3094. 1927.  2928     62 11247

So during the covid pandemic, the minimum taxi ride on a day was 7.

zone %>% 
   features(rides_day, quantile)
## # A tibble: 3 × 6
##   PULocation          `0%` `25%` `50%` `75%` `100%`
##   <chr>              <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 JFK Airport           16 5209  6917  8024.  12140
## 2 La Guardia Airport     7 3222. 6152. 9497.  17407
## 3 Other                 62 1286  2928  4300.  11247

Further Exploratory Plots

In the following series of plots, we will try and examine whether there are underlying trends and seasonality among our data. Plotting is a useful tool to grasp your data as opposed to scroll through your whole datatable.

Season Plots for a year by zones

In this plot, we can see the yearly trend for each of the 3 different zones. As mentioned previously, we do see a decrease in taxi rides after each year. Interestingly, the two airports were only affected between the earlier days of 2020 until end of 2021, whereas the other zones had a longer period until they recovered to post lock-down conditions.

zone %>% 
   gg_season(rides_day, labels = "right", alpha = .5)+
   theme_minimal()+
   labs(y= "Taxi Rides")

Season Plots for a year

Again, we can see the decline in raxi rides for each additional year, and then it finds an equilibrium around 100’000 taxi-rides per day during 2022 and 2023.

total %>%  
   gg_season(ride_tr, labels = "right", period = "year")+
   labs(y = "Transformed Total Taxi Rides",
        title = TeX(paste0("Box Cox Transformed Taxi Rides with $\\lambda$ = ", round(lambda, 2))))+
   theme_minimal()

Lagplots

Lagplots per year and day

This is a lag plot, where each lag is a discrepancy of one year. We should see around 365 points in each plot, where the y-axis represents the total of taxi rides in a day, wheras on the x-axis we can see the lagged taxi rides after lag_x year. So in this sense, each point illustrates each taxi ride in the first year and the x-th year. If the points would all align to the diagonal, then we would have a perfect correlation (and forecasting would therefore be easy, as knowing one quantity on one side can be used to estimate the other in x-years). We can also see that the higher lag numbers come with more spread, which makes sense as the correlations over a greater time span are lower than those which are more proximate. In this case, using time-series data which dates several years back are less reliable for forecasting than those that lie one year behind.

total %>% 
   gg_lag(y = ride_tr, lags = 1:16, period = "year", geom = "point")+
   theme_minimal()+
   theme(legend.position = "none")

Or a lagplot for different weekdays:

total %>% 
   gg_lag(y = ride_tr, lags = 1:16, geom = "point", period = "weeks")+
   theme(legend.position = "none")+
   theme_minimal()

In this case, while all the weekdays correlate strongly, in this plot we can also see the different colored “lines”, as weekdays also has higher correlations

Lagged autocorrelations

These are the correlations for each lag, for those who are interested in it.

total %>% 
   ACF(y =  Total, lag_max = 1000)
## # A tsibble: 1,000 x 2 [1D]
##         lag   acf
##    <cf_lag> <dbl>
##  1       1D 0.934
##  2       2D 0.901
##  3       3D 0.924
##  4       4D 0.922
##  5       5D 0.895
##  6       6D 0.910
##  7       7D 0.944
##  8       8D 0.908
##  9       9D 0.891
## 10      10D 0.915
## # ℹ 990 more rows
total %>% 
   ACF(y =  Total, lag_max = 365) %>% 
   autoplot()+
   theme_minimal()+
   theme(axis.minor.ticks.length = element_blank(),
         axis.minor.ticks.theta = element_blank(),
         axis.minor.ticks.r = element_blank(),
         panel.grid.major  = element_blank())

As we can see, there is a trend (e.g. the autocorrelation decreases over time), but there are still spikes here and there, which also indicates seasonality.

Decomposition

We can decompose our time-series into different parts, such as trends, seasonality and remainders. Here is a table after the decomposition was done with the seasonal-trend decomposition using Loess (STL).

decomp <- total %>% 
   model(stl = STL(ride_tr))

components(decomp) %>% 
   mutate_if(.predicate = is.numeric, .funs = ~round(., digits = 2))%>% 
   paged_table()

We can also try to describe the seasonality and trend strenghts:

zone %>% 
   features(rides_day, feat_stl)
## # A tibble: 3 × 10
##   PULocation         trend_strength seasonal_strength_week seasonal_peak_week
##   <chr>                       <dbl>                  <dbl>              <dbl>
## 1 JFK Airport                 0.962                  0.606                  0
## 2 La Guardia Airport          0.953                  0.753                  0
## 3 Other                       0.975                  0.823                  5
## # ℹ 6 more variables: seasonal_trough_week <dbl>, spikiness <dbl>,
## #   linearity <dbl>, curvature <dbl>, stl_e_acf1 <dbl>, stl_e_acf10 <dbl>

we can also see that for the “other” group, seasonality is higher for weeks compared to the two airports. Here is a plot of the model (here in yellow), if we transform the data back to its original shape.

components(decomp) %>% 
   as_tsibble() %>% 
   autoplot(inv_box_cox(`ride_tr`, lambda)) +
   geom_line(aes(y = inv_box_cox((trend),lambda)), color = "yellow")

components(decomp) %>% 
   autoplot()

Forecasting

Benchmarkmodels

Before modeling our data, we should have a benchmark which our models need to surpass. For this, we can create 4 different benchmark models and see which one is best.

set.seed(42)
n_train <- min(total$date) + round(nrow(total)* .9)
n_test <- nrow(total) - round(nrow(total)* .9)

ride_train <- total %>% 
      filter(date <= n_train)


ride_fit <- ride_train %>% 
   model(`Seas Naive` = SNAIVE(box_cox(Total, lambda)),
         `Mean` = MEAN(box_cox(Total, lambda)),
         Naive = NAIVE(box_cox(Total, lambda)),
         Drift = RW(box_cox(Total, lambda) ~ drift())
         ) 

ride_fc <- ride_fit %>% 
   forecast(h = n_test) %>% 
   filter(date <= "2024-01-31")

#ride_fc <- ride_fc %>% mutate(Total = inv_box_cox(.mean, lambda))


ride_fc %>% 
   autoplot(level =NULL, linewidth = .9, alpha = .8)+
   autolayer(filter_index(total, "2022-01-01" ~ . ))+
   scale_color_manual(values = my_palette)+
   theme_minimal()

In this case, the naive and drift model were bad, the mean is acceptable, but the seasonaly naive could be best.

accuracy(ride_fc, total)%>% 
   mutate_if(is.numeric, .funs = ~round(., digits = 2))
## # A tibble: 4 × 10
##   .model     .type      ME   RMSE    MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>      <chr>   <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Drift      Test  -23666. 28092. 23748. -511.  511. 15.2  11.2   0.99
## 2 Mean       Test   -6517.  6782.  6517. -173.  173.  4.18  2.71  0.56
## 3 Naive      Test  -23715. 28067. 23796. -511.  512. 15.3  11.2   0.99
## 4 Seas Naive Test   -2861.  4102.  3094. -119.  121.  1.99  1.64  0.83

Forecasting with 90/10 split

Untransformed Modeling

Lets train on 90% of the data, and test it on the remaining ones.

set.seed(42)
n_train <- min(total$date) + round(nrow(total)* .9)
n_test <- nrow(total) - round(nrow(total)* .9)

ride_train <- total %>% 
      filter(date <= n_train)


ride_fit <- ride_train %>% 
   model(`Seas Naive` = SNAIVE(box_cox(Total, lambda)),
         ARIMA_search = ARIMA(box_cox(Total, lambda)),
         ARIMA_spec111 = ARIMA(box_cox(Total, lambda) ~ 1 + pdq(1,1,1)),
         sARIMA = ARIMA(box_cox(Total, lambda) ~ 1 + pdq(3,1,3) + PDQ(1, 0, 0)),
         `Exp_smooth_with_damp` = ETS(box_cox(Total, lambda) ~ error("A") + trend("A", phi = 0.4) + season("N")),
         `sExp_smooth_with_damp` = ETS(box_cox(Total, lambda) ~ error("A") + trend("A", phi = 0.4) + season("A"))
         ) 

ride_fc <- ride_fit %>% 
   forecast(h = n_test) %>% 
   filter(date <= "2024-01-31")

#ride_fc <- ride_fc %>% mutate(Total = inv_box_cox(.mean, lambda))


ride_fc %>% 
   autoplot(level =NULL, linewidth = .9, alpha = .8)+
   autolayer(filter_index(total, "2022-01-01" ~ . ))+
   scale_color_manual(values = my_palette)

pdq in ARIMA stands for:
- p: Autoregressive Order (number of lagged observations included in the model)
- d: differencing Order (number of times the data needs to be differenced in order to remove trends and stabilize the mean)
- q: moving average Order (number of lagged forecast errorsincluded in the model)

accuracy(ride_fc, total)%>% 
   mutate_if(is.numeric, .funs = ~round(., digits = 2))
## # A tibble: 6 × 10
##   .model                .type     ME  RMSE   MAE    MPE  MAPE  MASE RMSSE  ACF1
##   <chr>                 <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA_search          Test   -114. 2128. 1572.  -71.7  84.1  1.01  0.85  0.66
## 2 ARIMA_spec111         Test    548. 2007. 1553.  -58.9  76.9  1     0.8   0.59
## 3 Exp_smooth_with_damp  Test    168. 1981. 1488.  -65.4  79.4  0.96  0.79  0.6 
## 4 Seas Naive            Test  -2861. 4102. 3094. -119.  121.   1.99  1.64  0.83
## 5 sARIMA                Test    634. 1906. 1463.  -56.1  74.2  0.94  0.76  0.68
## 6 sExp_smooth_with_damp Test    280. 1818. 1332.  -60.4  74.4  0.86  0.73  0.76

Our ARIMA search model overfitted a bit, wheras our consevative ARIMA approach was better (111)

So our seasonal Exponential smoothing model and seasonal ARIMA was best, but it still sucks… a mean average percentage error (MAPE) of 74 is still very bad. This is probably due to the long period of time we try to predict on. Let’s see whether this changes when we only try to predict the next months worth of taxi rides instead of 13 months.

Transformed Modeling

This uses the same pipeline, but this time with transformed taxi rides data.

set.seed(42)
n_train <- min(total$date) + round(nrow(total)* .9)
n_test <- nrow(total) - round(nrow(total)* .9)

ride_train <- total %>% 
      filter(date <= n_train)


ride_fit_tr <- ride_train %>% 
   model(`Seas Naive` = SNAIVE(ride_tr),
         ARIMA_search = ARIMA(ride_tr),
         ARIMA_spec111 = ARIMA(ride_tr ~ 1 + pdq(1,1,1)),
         sARIMA = ARIMA(ride_tr ~ 1 + pdq(3,1,3) + PDQ(1, 0, 0)),
         `Exp_smooth_with_damp` = ETS(ride_tr ~ error("A") + trend("A", phi = 0.4) + season("N")),
         `sExp_smooth_with_damp` = ETS(ride_tr ~ error("A") + trend("A", phi = 0.4) + season("A"))
         ) 

ride_fc_tr <- ride_fit_tr %>% 
   forecast(h = n_test) %>% 
   filter(date <= "2024-01-31")

#ride_fc <- ride_fc %>% mutate(Total = inv_box_cox(.mean, lambda))


ride_fc_tr %>% 
   autoplot(level =NULL, linewidth = .9, alpha = .8)+
   autolayer(filter_index(total, "2022-10-01" ~ . ), .vars = ride_tr)+
   scale_color_manual(values = my_palette)

Please bare in mind that our dataset spans several years, and we only forecast on 405 days, the plot above only shows a snippet of our actual data (in this case, the relevant timeframe). The data is still in transformed form.

We can also quantify the forecasting quality by computing and comparing accuracy metrics:

accuracy(ride_fc_tr, total) %>% 
   mutate_if(is.numeric, .funs = ~round(., digits = 2))
## # A tibble: 6 × 10
##   .model                .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>                 <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA_search          Test   1.59  3.49  2.4   2.3   7.78  1.93  1.63  0.65
## 2 ARIMA_spec111         Test   2.01  3.7   2.7   3.43  8.49  2.16  1.73  0.66
## 3 Exp_smooth_with_damp  Test   1.61  3.5   2.42  2.35  7.82  1.94  1.64  0.65
## 4 Seas Naive            Test   1.01  3.21  2.03  0.8   6.88  1.63  1.51  0.67
## 5 sARIMA                Test   1.95  3.6   2.61  3.31  8.26  2.09  1.68  0.71
## 6 sExp_smooth_with_damp Test   2.23  3.66  2.8   4.11  8.73  2.25  1.72  0.73

For the transformed data, the seasonaly naive (our benchmark model) is now best. Probably because the transformed data does not have that much variance, a “straight line” is best. What about for a shorter timeperiod?

Forecasting for 90 days

set.seed(42)
n_train <- max(total$date) - 90
n_test <- 90

ride_train <- total %>% 
      filter(date <= n_train)


ride_fit <- ride_train %>% 
   model(`Seas Naive` = SNAIVE(ride_tr),
         #ARIMA_search = ARIMA(ride_tr),
         #ARIMA_spec111 = ARIMA(ride_tr ~ 1 + pdq(1,1,1)),
         sARIMA = ARIMA(ride_tr ~ 1 + pdq(3,1,3) + PDQ(1, 0, 0)),
         `Exp damp` = ETS(ride_tr ~ error("A") + trend("A", phi = 0.4) + season("N")),
         `sExp damp` = ETS(ride_tr ~ error("A") + trend("A", phi = 0.4) + season("A"))
         ) 

ride_fc <- ride_fit %>% 
   forecast(h = n_test) %>% 
   filter(date <= "2024-01-31")

#ride_fc <- ride_fc %>% mutate(Total = inv_box_cox(.mean, lambda))


ride_fc %>% 
   autoplot(level =NULL, linewidth = 1.2, alpha = .9)+
   autolayer(filter_index(total, paste(n_train-2) ~ . ), .vars = ride_tr, linewidth = 1.3)+
   scale_color_manual(values = my_palette)+
   theme_minimal()+
   theme(legend.text = element_text(size = 10),
         legend.title = element_blank(),
         legend.position = "top",
         legend.direction = "horizontal",
         axis.text.x = element_text(size = 10))+
   guides(color = guide_legend(override.aes = list(size = 8, linewidth = 5)))+
   labs(y = "Transformed Taxi Rides")

accuracy(ride_fc, total) %>% 
   mutate_if(is.numeric, .funs = ~round(., digits = 2))
## # A tibble: 4 × 10
##   .model     .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>      <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Exp damp   Test  -0.77  2.21  1.69 -2.37  4.64  1.32  0.93  0.52
## 2 Seas Naive Test  -0.79  2.42  1.85 -2.39  5.04  1.45  1.02  0.52
## 3 sARIMA     Test  -0.82  2.14  1.63 -2.48  4.5   1.28  0.9   0.59
## 4 sExp damp  Test  -0.52  2.04  1.53 -1.62  4.17  1.2   0.86  0.62

Now the seasonal exponential smoothing model was best, followed by the seasonal ARIMA model, and then the dampened exponential smoothed one. All of them fared better than the benchmarkmodel.

But what about the residuals? Do these model map the data equally well for all timepoints?

aug <- augment(ride_fit)

aug %>% 
   autoplot(.innov)+
   theme_minimal()+
   facet_grid(rows = vars(.model))

aug %>% 
   ggplot(aes(x = .innov))+
   geom_histogram()+
   facet_grid(rows = vars(.model))+
   theme_minimal()

aug %>% 
   ACF(.innov) %>% 
   autoplot()

All of these residual plots above indicates that all the models are pretty fine. Some of them probably fit the data too well even… Bare in mind these are only on the training data! So them having low amount of residual is normal.

Let’s plot all of the residual plots for our best model:

ride_train %>% 
   model(`sExp damp` = ETS(ride_tr ~ error("A") + trend("A", phi = 0.4) + season("A"))) %>% 
   gg_tsresiduals()

Forecasting by segmenting by keys

So far, we’ve only forecasted for the total amount of rides in Queens. But as said in the introduction, we can actually try to fit a model per each district, and then combine them all together. Let us see whether we get a good forecast.

set.seed(42)
n_train <- max(total$date) - 90
n_test <- 90


zone_train <- zone %>% 
   group_by(PULocation) %>% 
      filter(date <= n_train)

zone_fit <- zone_train %>% 
   model(
      `Seas Naive` = SNAIVE(rides_day),
      `sExp damp` = ETS(rides_day ~ error("A") + trend("A", phi = 0.1) + season("A"))
   ) 


zone_fc <- zone_fit %>% 
   forecast(h = n_test)

combined_zone_fc <- zone_fc %>%
   as_tsibble() %>% 
   group_by(.model) %>%
   index_by(date) %>%
   summarize(tot_pred = sum(.mean), .groups = "drop") %>%
   as_tsibble()


combined_zone_fc %>% 
   autoplot(tot_pred,  level = NULL, linewidth = 1.2, alpha = .8)+
   autolayer(filter_index(total, paste(n_train-2) ~. ), .vars = Total, linewidth = 1.3)+
   theme_minimal()+
   scale_color_manual(values = my_palette)

Time Series Regression

We can also use timeseries data to draw some linear regression. As always, the higher the correlations are between the data, the better.

First of all, let’s see whether we can use our weather and lockdown data to “describe” the data with a linear regression.

total %>% 
   left_join(full_ride2, by = "date") %>% 
   model(tslm_weather_lockdown = TSLM(Total ~ Max_Temp_C + lock_down + Max_rain_mm),
         tslm_weather = TSLM(Total ~ Max_Temp_C + Max_rain_mm)) %>% 
   report()
## # A tibble: 2 × 15
##   .model r_squared adj_r_squared sigma2 statistic   p_value    df log_lik    AIC
##   <chr>      <dbl>         <dbl>  <dbl>     <dbl>     <dbl> <int>   <dbl>  <dbl>
## 1 tslm_…   0.246         0.246   4.20e7    441.   1.19e-247     4 -41272. 71066.
## 2 tslm_…   0.00258       0.00208 5.56e7      5.23 5.40e-  3     3 -41839. 72198.
## # ℹ 6 more variables: AICc <dbl>, BIC <dbl>, CV <dbl>, deviance <dbl>,
## #   df.residual <int>, rank <int>

While the result may seem very good, it does not generalize well, as the predictive part is the lockdown itself, which is in itself a rare event and is only helpful during a pandemic. Do not overestimate this finding, as this variable is useless outside of the pandemic…

Looking only at the R-squared value, our weather data does not seem to be very good (even though in the literature, there should be a big effect). One possible reason is that our data doesn’t have enough granularity (e.g., if we had hourly data, then maybe weather data such as rain could be predictive).

But what about using the taxi ride from one airport and trying to predict the amount of taxirides for the other?

zone %>% 
   pivot_wider(names_from = PULocation,
               values_from = rides_day) %>% 
   rowwise() %>% 
   mutate(Total = sum(`JFK Airport`, `La Guardia Airport`, Other)) %>% 
   as_tsibble(index = date) %>% 
   model(LG_JFK = TSLM(`La Guardia Airport` ~ `JFK Airport`),
         LG_JFK_weather = TSLM(`La Guardia Airport` ~ `JFK Airport` + Max_Temp_C + Max_rain_mm),
         LG_JFK_weather_lockdown = TSLM(`La Guardia Airport` ~ `JFK Airport` + Max_Temp_C + Max_rain_mm + lock_down),
         LG_weather_only = TSLM(`La Guardia Airport` ~ Max_Temp_C + Max_rain_mm)) %>% 
   report()
## # A tibble: 4 × 15
##   .model   r_squared adj_r_squared sigma2 statistic p_value    df log_lik    AIC
##   <chr>        <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <int>   <dbl>  <dbl>
## 1 LG_JFK     0.740         0.740   3.55e6  11529.   0           2 -36272. 61062.
## 2 LG_JFK_…   0.744         0.744   3.50e6   3919.   0           4 -36241. 61005.
## 3 LG_JFK_…   0.756         0.756   3.34e6   3129.   0           5 -36146. 60816.
## 4 LG_weat…   0.00342       0.00293 1.36e7      6.95 9.71e-4     3 -38993. 66506.
## # ℹ 6 more variables: AICc <dbl>, BIC <dbl>, CV <dbl>, deviance <dbl>,
## #   df.residual <int>, rank <int>

We can see that using this approach, we get a very good model. Adding weather and lockdown data was not helpful at all, as these factors are already in the taxi rides data as well.

Discussion

Limitations

  • Weather data and reliability (e.g. only getting max mm rain and highest temperature).
  • Covid-19 measures were somewhat arbitrarily taken, as well as the source stemming from a Wikipedia article.
  • While forecasting is basically a time-series analysis, prediction based methods need the predictor variables as well. In our case, it is very very hard to get accurate weather data into the future (e.g. 1 week or longer).
  • Even though I’ve used prediction as well, in a “real world” scenario where I get tasked to predict the total amount of taxi rides within a given timeframe, the data will not be available.
  • Categorization into 3 zones depending on PICKUP LOCATION (so starting location), nothing else (e.g. landing etc).

Credits and Aknowledgements

References

  • Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 26(2), 211–252. https://doi.org/10.1111/j.2517-6161.1964.tb00553.x

  • Cleveland, R. B., Cleveland, W. S., McRae, J. E., & Terpenning, I. J. (1990). STL: A seasonal-trend decomposition procedure based on loess. Journal of Official Statistics, 6(1), 3–33. http://bit.ly/stl1990

  • Green, K.C., Graefe, A., Armstrong, J.S. (2011). Forecasting Principles. In: Lovric, M. (eds) International Encyclopedia of Statistical Science. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04898-2_257